I am looking forward to get more first-hand experience with R and data visualization and analysis.
My GitHub:
https://github.com/AleksanKo/IODS-project
I have completed the data wrangling exercises: I’ve read the data from the file with the help of read.csv function, created a new dataset by taking columns from the old one and written it to a new file (write.csv).
lrn2014 <- read.csv('data/lrn2014.csv')
str(lrn2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
dim(lrn2014)
## [1] 166 7
The dataset consists of 166 observations and 7 variables. There is only one factor variable (gender), other variables are integer. attitude shows global attitude towards statistics, points are exam points. deep stands for deep approach, surf and stra for surface and strategic approach respectively.
I have completed the data wrangling exercises: I’ve read the data from the file with the help of read.csv function, created a new dataset by taking columns from the old one and written it to a new file (write.csv).
matpor <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep=',')
colnames(matpor)
## [1] "school" "sex" "age" "address" "famsize" "Pstatus"
## [7] "Medu" "Fedu" "Mjob" "Fjob" "reason" "nursery"
## [13] "internet" "guardian" "traveltime" "studytime" "failures" "schoolsup"
## [19] "famsup" "paid" "activities" "higher" "romantic" "famrel"
## [25] "freetime" "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The dataset consists of 35 variables. Many factors has only two levels and are thus binary. Such factors include:
Other factor variables include:
Integer variables include (if it ranges from 1 to 5, it is a range between “very bad”/“very low” and “excellent”/“very high”):
There is also a logical variable high_use which shows if the average alcohol consumption is high (more than 2) or not.
I have chosen 4 variables: activities, famrel, higher and internet. My hypotheses for them are following:
Plot for Hypothesis 1: It seems that there are more students with high alcohol consumption amongst those who don’t have any activities. However, the data shows (both visually and mathematically) that the number of students who have a high alcohol consumption is almost the same:
n1<-nrow(filter(matpor,activities=="yes",high_use==TRUE))
n2<-nrow(filter(matpor,activities=="no",high_use==TRUE))
The difference between low alcohol consumption of the active and non-active students is also not so significant:
n3<-nrow(filter(matpor,activities=="yes",high_use==FALSE))
n4<-nrow(filter(matpor,activities=="no",high_use==FALSE))
Plot for Hypothesis 2:
It seems like most of the students ranked their family relationships as good or excellent (>250 students) which makes further analysis difficult. However, it is worth noting that the maximal amount of students with a hish alcohol consumption comes from a family with good internal relationships.
Plot for Hypothesis 3:
There is also no correlation between striving to get a higher education and alcohol consumption, sinse almost every student wants to get a higher education.
Plot for Hypothesis 4:
Almost the same results as for Hypothesis 1: there seems no correlation between having Internet and alcohol consumption. Only 15 students who don’t have Internet at home drink alcohol a lot, but there are 97 who drink a lot of alcohol and have Internet. However, if we count the percentage, it turns out that 26% of those without Internet drink alcohol and the corresponding amount of the students with Internet is 30%. The same filter (as in Hypothesis 1) was used to count the numbers:
i<-nrow(filter(matpor,internet=="yes",high_use==TRUE))
m <- glm(high_use ~ activities + famrel + internet + higher, data = matpor, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ activities + famrel + internet + higher,
## family = "binomial", data = matpor)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3167 -0.8643 -0.7642 1.3863 1.8870
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.9163 0.7070 1.296 0.1950
## activitiesyes -0.2410 0.2302 -1.047 0.2952
## famrel -0.2893 0.1213 -2.385 0.0171 *
## internetyes 0.2733 0.3297 0.829 0.4071
## higheryes -0.8244 0.4939 -1.669 0.0951 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 451.26 on 377 degrees of freedom
## AIC: 461.26
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) activitiesyes famrel internetyes higheryes
## 0.9162589 -0.2409905 -0.2893345 0.2732788 -0.8244081
OR <- exp(coef(m))
OR
## (Intercept) activitiesyes famrel internetyes higheryes
## 2.4999204 0.7858491 0.7487617 1.3142666 0.4384945
#commenting out because otherwise index.html refuses to render
#probabilities <- predict(m, type = "response")
#alc <- mutate(alc, probability = probabilities)
#alc <- mutate(alc, prediction = probability > 0.5)
#table(high_use = alc$high_use, prediction = alc$prediction)
#g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))
#geom<-g + geom_point(aes(x = probability, y = high_use, col=prediction))
#show(geom)
The prediction is wrong in 119 cases. Only 5 cases are true positive as TRUE, but 258 cases are true positive as FALSE.
#loss_func <- function(class, prob) {
#n_wrong <- abs(class - prob) > 0.5
#mean(n_wrong)
#}
#loss_func(alc$high_use,alc$probability)
#library(boot)
#cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
#cv$delta[1]
The number of the wrong predictions on the testing data is the same as on the traning data.
Data wrangling exercises are done in the corresponding R script.
Firstly, the Boston dataset is loaded from the MASS package:
library(MASS)
library(dplyr)
data(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The data has 14 variables, all of them are either numeric or integer. Variables include e.g. crime rate by town per capita (crim), nitrogen oxides concentration (nox), index of accessibility to radial highways (rad), pupil-teacher ratio by town (ptratio), average number of rooms per dwelling (rm) and others.
summary(Boston)
## crim zn indus chas nox
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 Min. :0.3850
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 1st Qu.:0.4490
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000 Median :0.5380
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917 Mean :0.5547
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 3rd Qu.:0.6240
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 Max. :0.8710
## rm age dis rad tax
## Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000 Min. :187.0
## 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0
## Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000 Median :330.0
## Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549 Mean :408.2
## 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000 Max. :711.0
## ptratio black lstat medv
## Min. :12.60 Min. : 0.32 Min. : 1.73 Min. : 5.00
## 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95 1st Qu.:17.02
## Median :19.05 Median :391.44 Median :11.36 Median :21.20
## Mean :18.46 Mean :356.67 Mean :12.65 Mean :22.53
## 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :22.00 Max. :396.90 Max. :37.97 Max. :50.00
rad (index of accessibility ro radial highways) varies from 1 to 24. lstat (lower status of the population (in percents)) has several outliers: min lstat is 1.73%, max lstat is 37.97 %, and the mean is 12.65%. There is also a noticeable outlier in crim variable: max crim = 87.98. An outlier is also present in black variable (1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town): min black = 0.32.
pairs(Boston)
There is a clear hyperbolic relationship between nox and dis (weighted mean of distances to five Boston employment centres) variables: the more the concentration of nitrogen oxides, the less the weighted mean of distances to employment centers. Thus, we can say that there is a correlation between nitrogen oxides and employments centers.
There is also a hyperbolic relationship between lstat and medv (median value of owner-occupied homes in $1000s) variables: the lower the status of the population, the less the median cost of the homes in the area, which is pretty logical.
There is almost a linear correlation between rm (average number of rooms per dwelling) and lstat variables: the more rooms in the dwelling, the more the median cost of the homes and vice versa.
Now the dataset will be standardized and a new variable will be added to the dataset (the old variable crim will be dropped):
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis rad
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black lstat medv
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
#creating a new variable
breakpoints <- quantile(boston_scaled$crim)
labels <- c('low','med_low','med_high','high')
crime <- cut(boston_scaled$crim, breaks = breakpoints, include.lowest = TRUE, label=labels)
#dropping and adding variables
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Diving the data to train and test data:
n <- nrow(boston_scaled)
random_rows <- sample(n, size = n * 0.8)
train_data <- boston_scaled[random_rows,]
test_data <- boston_scaled[-random_rows,]
Fitting and plotting the LDA:
lda.fit <- lda(crime ~ ., data = train_data)
classes <- as.numeric(train_data$crime)
plot(lda.fit, dimen = 2,col=classes,pch=classes)
Predicting the classes with the LDA model:
correct_classes <- test_data[,"crime"]
test_data <- dplyr::select(test_data, -crime)
lda.pred <- predict(lda.fit, newdata = test_data)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 13 14 2 0
## med_low 4 18 4 0
## med_high 0 8 16 2
## high 0 0 1 20
Overall results show that high crime rate was predicted correctly (there is only one case when medium high crime rate was predicted wrong as high). Low crime rate was predicted correctly for 17 cases, for 13 cases it was predicted as med_low and for 2 - as med_high. Medium low crime rate was predicted correctly 15 times with only 6 errors.
Calculating the distances and visualizing the clusters:
library(MASS)
data('Boston')
boston_scaled_again <- scale(Boston)
dist_eu <- dist(boston_scaled_again)
km <-kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)
The optimal number of clusters is 2: 3 look good already, but one of them (the black one) doesn’t seem to be of great significance. More than 3 clusters is abundant. In case of rad, tax and ptratio the red cluster clearly shows outliers. There are overlapping clusters in lstat and medv plot: one is corresponding to the bigger amount of people who are of lower status and to the smaller median cost of homes, and the other is corresponding to the bigger amount of people with higher status and to bigger median cost of homes (basically, it is poverty vs richness dichotomy)
Bonus Task:
model_predictors <- dplyr::select(train_data, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
install.packages("plotly")
## Error in install.packages : Updating loaded packages
library("plotly")
#colors from the crime classes
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color=train_data$crime)
***
Data wrangling exercises are done in the corresponding R script.
Firstly, the data is read from csv-file:
library(GGally)
human_data <- read.csv('data/human1.csv',row.names = 1)
summary(human_data)
## edu_ratio lab_ratio edu_exp life_exp gni
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00 Min. : 581
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30 1st Qu.: 4198
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20 Median : 12040
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65 Mean : 17628
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25 3rd Qu.: 24512
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50 Max. :123124
## mat_mor adol_birth parl_perc
## Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 49.0 Median : 33.60 Median :19.30
## Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :1100.0 Max. :204.80 Max. :57.50
ggpairs(human_data)
## plot: [1,1] [>-------------------------------------------------------------] 2% est: 0s
## plot: [1,2] [=>------------------------------------------------------------] 3% est: 1s
## plot: [1,3] [==>-----------------------------------------------------------] 5% est: 1s
## plot: [1,4] [===>----------------------------------------------------------] 6% est: 1s
## plot: [1,5] [====>---------------------------------------------------------] 8% est: 1s
## plot: [1,6] [=====>--------------------------------------------------------] 9% est: 1s
## plot: [1,7] [======>-------------------------------------------------------] 11% est: 1s
## plot: [1,8] [=======>------------------------------------------------------] 12% est: 2s
## plot: [2,1] [========>-----------------------------------------------------] 14% est: 2s
## plot: [2,2] [=========>----------------------------------------------------] 16% est: 2s
## plot: [2,3] [==========>---------------------------------------------------] 17% est: 2s
## plot: [2,4] [===========>--------------------------------------------------] 19% est: 1s
## plot: [2,5] [============>-------------------------------------------------] 20% est: 1s
## plot: [2,6] [=============>------------------------------------------------] 22% est: 1s
## plot: [2,7] [==============>-----------------------------------------------] 23% est: 1s
## plot: [2,8] [===============>----------------------------------------------] 25% est: 1s
## plot: [3,1] [===============>----------------------------------------------] 27% est: 1s
## plot: [3,2] [================>---------------------------------------------] 28% est: 1s
## plot: [3,3] [=================>--------------------------------------------] 30% est: 1s
## plot: [3,4] [==================>-------------------------------------------] 31% est: 1s
## plot: [3,5] [===================>------------------------------------------] 33% est: 1s
## plot: [3,6] [====================>-----------------------------------------] 34% est: 1s
## plot: [3,7] [=====================>----------------------------------------] 36% est: 1s
## plot: [3,8] [======================>---------------------------------------] 38% est: 1s
## plot: [4,1] [=======================>--------------------------------------] 39% est: 1s
## plot: [4,2] [========================>-------------------------------------] 41% est: 1s
## plot: [4,3] [=========================>------------------------------------] 42% est: 1s
## plot: [4,4] [==========================>-----------------------------------] 44% est: 1s
## plot: [4,5] [===========================>----------------------------------] 45% est: 1s
## plot: [4,6] [============================>---------------------------------] 47% est: 1s
## plot: [4,7] [=============================>--------------------------------] 48% est: 1s
## plot: [4,8] [==============================>-------------------------------] 50% est: 1s
## plot: [5,1] [===============================>------------------------------] 52% est: 1s
## plot: [5,2] [================================>-----------------------------] 53% est: 1s
## plot: [5,3] [=================================>----------------------------] 55% est: 1s
## plot: [5,4] [==================================>---------------------------] 56% est: 1s
## plot: [5,5] [===================================>--------------------------] 58% est: 1s
## plot: [5,6] [====================================>-------------------------] 59% est: 1s
## plot: [5,7] [=====================================>------------------------] 61% est: 1s
## plot: [5,8] [======================================>-----------------------] 62% est: 1s
## plot: [6,1] [=======================================>----------------------] 64% est: 1s
## plot: [6,2] [========================================>---------------------] 66% est: 1s
## plot: [6,3] [=========================================>--------------------] 67% est: 1s
## plot: [6,4] [==========================================>-------------------] 69% est: 1s
## plot: [6,5] [===========================================>------------------] 70% est: 1s
## plot: [6,6] [============================================>-----------------] 72% est: 1s
## plot: [6,7] [=============================================>----------------] 73% est: 1s
## plot: [6,8] [=============================================>----------------] 75% est: 1s
## plot: [7,1] [==============================================>---------------] 77% est: 1s
## plot: [7,2] [===============================================>--------------] 78% est: 0s
## plot: [7,3] [================================================>-------------] 80% est: 0s
## plot: [7,4] [=================================================>------------] 81% est: 0s
## plot: [7,5] [==================================================>-----------] 83% est: 0s
## plot: [7,6] [===================================================>----------] 84% est: 0s
## plot: [7,7] [====================================================>---------] 86% est: 0s
## plot: [7,8] [=====================================================>--------] 88% est: 0s
## plot: [8,1] [======================================================>-------] 89% est: 0s
## plot: [8,2] [=======================================================>------] 91% est: 0s
## plot: [8,3] [========================================================>-----] 92% est: 0s
## plot: [8,4] [=========================================================>----] 94% est: 0s
## plot: [8,5] [==========================================================>---] 95% est: 0s
## plot: [8,6] [===========================================================>--] 97% est: 0s
## plot: [8,7] [============================================================>-] 98% est: 0s
## plot: [8,8] [==============================================================]100% est: 0s
The plot above visualizes the human_data variables.
The plot below shows the correlations between variables:
cor(human_data) %>%
corrplot::corrplot(method='pie', type='upper')
Let’s take a look at the edu_exp (expected years of schooling). It highly positively correlates with life_exp (life expectancy at birth) and with gni (Gross National Income per capita). On the other hand, it is highly negatively correlates with mat_mor (maternal mortality ratio) and adol_birth (adolescent birth rate) which actually makes sense: the higher the quality of life in the country, the better the level of medical care in it. Maternal mortality also highly correlates with the adolescent birth rate.
However, parl_perc (percentage of female representatives in parliament) doesn’t seem to correlate highly with anything. It correlates slightly with lab_ratio (labour ratio) and edu_exp, but not so much.
pca_human <- prcomp(human_data)
biplot(pca_human, choices = 1:2, cex=c(0.8,1), col=c("blue", "red"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len):
## zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len):
## zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len):
## zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len):
## zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len):
## zero-length arrow is of indeterminate angle and so skipped
Only one value (gni) went into the PCA. It points towards the negative values of PC1 and to zero of PC2, so we can say that the lower the value of PC1, the higher the GNI.
human_std <- scale(human_data)
pca_human <- prcomp(human_std)
biplot(pca_human, choices = 1:2, cex=c(0.8,1), col=c("blue", "red"))
In contrast to non-standardized data, all values are included in the PCA.
The expected years of education, life expectancy, Gross National Income and the ratio of females to males having at least secondary education correlate with each other. The lower the value of PC1, the higher the values of these variables.
On the other hand, maternal mortality correlates with the adolescent birth rate (as it was shown in the correlation plot). The higher the value of PC1, the higher the values of these variables.
There is also a correlation between percentage of female representatives in parliament and the ratio of females to males in the labour force, though it is lower than aforementioned correlations.
library(FactoMineR)
library(tidyr)
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
reasons <- c('sex','spirituality', 'healthy','diuretic','exciting', 'relaxing', 'escape.exoticism')
tea_reason <- select(tea,one_of(reasons))
gather(tea_reason) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
It seems that most respondents are women. Respondents overall think that tea drinking is healthy and relaxing, but is not spiritual nor exciting.
mca <- MCA(tea_reason, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_reason, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.189 0.175 0.168 0.145 0.125 0.109 0.088
## % of var. 18.888 17.528 16.809 14.539 12.509 10.947 8.781
## Cumulative % of var. 18.888 36.416 53.225 67.764 80.272 91.219 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
## 1 | 0.070 0.009 0.005 | 0.587 0.655 0.349 | -0.311 0.192 0.098 |
## 2 | 0.562 0.556 0.332 | -0.426 0.346 0.191 | -0.324 0.208 0.111 |
## 3 | -0.504 0.448 0.401 | -0.135 0.034 0.029 | -0.537 0.572 0.456 |
## 4 | -0.417 0.306 0.156 | -0.013 0.000 0.000 | 0.784 1.219 0.551 |
## 5 | -0.088 0.014 0.006 | -0.136 0.035 0.014 | 0.774 1.188 0.463 |
## 6 | -0.432 0.329 0.223 | 0.544 0.562 0.353 | -0.021 0.001 0.001 |
## 7 | -0.432 0.329 0.223 | 0.544 0.562 0.353 | -0.021 0.001 0.001 |
## 8 | -0.582 0.598 0.467 | 0.345 0.227 0.164 | -0.287 0.163 0.113 |
## 9 | -0.182 0.059 0.030 | 0.900 1.542 0.731 | 0.220 0.096 0.044 |
## 10 | -0.353 0.220 0.168 | 0.064 0.008 0.005 | -0.271 0.146 0.099 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## F | -0.186 1.553 0.050 -3.885 | -0.237 2.707 0.082 -4.942 |
## M | 0.271 2.266 0.050 3.885 | 0.345 3.950 0.082 4.942 |
## Not.spirituality | 0.032 0.053 0.002 0.815 | 0.377 7.963 0.312 9.656 |
## spirituality | -0.070 0.115 0.002 -0.815 | -0.827 17.451 0.312 -9.656 |
## healthy | -0.228 2.753 0.121 -6.023 | -0.314 5.614 0.230 -8.285 |
## Not.healthy | 0.532 6.425 0.121 6.023 | 0.732 13.099 0.230 8.285 |
## diuretic | 0.100 0.442 0.014 2.041 | -0.591 16.484 0.482 -11.999 |
## Not.diuretic | -0.139 0.611 0.014 -2.041 | 0.815 22.764 0.482 11.999 |
## exciting | 0.959 26.915 0.580 13.171 | -0.340 3.634 0.073 -4.662 |
## No.exciting | -0.605 16.968 0.580 -13.171 | 0.214 2.291 0.073 4.662 |
## Dim.3 ctr cos2 v.test
## F -0.310 4.860 0.141 -6.484 |
## M 0.453 7.090 0.141 6.484 |
## Not.spirituality -0.383 8.573 0.322 -9.811 |
## spirituality 0.840 18.788 0.322 9.811 |
## healthy -0.207 2.550 0.100 -5.468 |
## Not.healthy 0.483 5.950 0.100 5.468 |
## diuretic -0.302 4.499 0.126 -6.139 |
## Not.diuretic 0.417 6.212 0.126 6.139 |
## exciting 0.221 1.599 0.031 3.029 |
## No.exciting -0.139 1.008 0.031 -3.029 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## sex | 0.050 0.082 0.141 |
## spirituality | 0.002 0.312 0.322 |
## healthy | 0.121 0.230 0.100 |
## diuretic | 0.014 0.482 0.126 |
## exciting | 0.580 0.073 0.031 |
## relaxing | 0.548 0.004 0.163 |
## escape.exoticism | 0.005 0.046 0.294 |
plot(mca, invisible=c("ind"), habillage = "quali")
desc_dim <- dimdesc(mca, axes = c(1,2))
# Description of dimension 1
desc_dim[[1]]
## $quali
## R2 p.value
## exciting 0.58019625 4.126097e-58
## relaxing 0.54849837 2.177383e-53
## healthy 0.12134448 5.514632e-10
## sex 0.05048393 8.646168e-05
## diuretic 0.01392559 4.109741e-02
##
## $category
## Estimate p.value
## exciting=exciting 0.33988361 4.126097e-58
## relaxing=No.relaxing 0.33213016 2.177383e-53
## healthy=Not.healthy 0.16518089 5.514632e-10
## sex=M 0.09939564 8.646168e-05
## diuretic=diuretic 0.05195503 4.109741e-02
## diuretic=Not.diuretic -0.05195503 4.109741e-02
## sex=F -0.09939564 8.646168e-05
## healthy=healthy -0.16518089 5.514632e-10
## relaxing=relaxing -0.33213016 2.177383e-53
## exciting=No.exciting -0.33988361 4.126097e-58
##
## attr(,"class")
## [1] "condes" "list "
# Description of dimension 2
desc_dim[[2]]
## $quali
## R2 p.value
## diuretic 0.48155717 2.051296e-44
## spirituality 0.31181525 5.400472e-26
## healthy 0.22959277 1.260640e-18
## sex 0.08167713 4.783524e-07
## exciting 0.07269802 2.151942e-06
## escape.exoticism 0.04581619 1.874833e-04
##
## $category
## Estimate p.value
## diuretic=Not.diuretic 0.29432081 2.051296e-44
## spirituality=Not.spirituality 0.25200437 5.400472e-26
## healthy=Not.healthy 0.21887953 1.260640e-18
## sex=M 0.12179157 4.783524e-07
## exciting=No.exciting 0.11589917 2.151942e-06
## escape.exoticism=Not.escape-exoticism 0.08974158 1.874833e-04
## escape.exoticism=escape-exoticism -0.08974158 1.874833e-04
## exciting=exciting -0.11589917 2.151942e-06
## sex=F -0.12179157 4.783524e-07
## healthy=healthy -0.21887953 1.260640e-18
## spirituality=spirituality -0.25200437 5.400472e-26
## diuretic=diuretic -0.29432081 2.051296e-44
##
## attr(,"class")
## [1] "condes" "list "
Not diuretic, exciting and spirituality seem to be highly discriminating attributes, since they are far from the origin.
Other plot with ellipses:
factoextra::fviz_mca_ind(mca,
label = "none",
habillage = "sex",
palette = c("#00AFBB", "#E7B800"),
addEllipses = TRUE, ellipse.type = "confidence",
ggtheme = theme_minimal())
Data wrangling exercises are done in the corresponding R script.
Firstly, the data is read from csv-file:
library(dplyr)
library(tidyr)
library(ggplot2)
RATS <- read.csv("data/RATSL.csv")
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
Now let’s draw the plot:
ggplot(RATS, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATS$Weight), max(RATS$Weight)))
There are 8 rats in Group 1 and 4 rats in Group 2 and Group 3. For Group 1 and Group 3 the weigth seems to lie in the certain range: for Group 1: from 225 to almost 300; for Group 3: from about 460 to 560. However, there is an outlier in the Group 2: one rat was 500 gr at week 0 and gained about 80 gr. Same for other 2 rats from Group 2, while the weight of the 3rd remained didn’t change so much. It is also worth noting that there are patterns in gaining weight: e.g. in Group 3 there is a decrease which is present for all 4 rats.
Let’s draw the plot for the same data, but standardised:
RATS <- RATS %>%
group_by(Time) %>%
mutate(stdweight = (Weight - mean(Weight))/sd(Weight)) %>%
ungroup()
ggplot(RATS, aes(x = Time, y = stdweight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "right") +
scale_y_continuous(name='standardized weight')
Group 1 is below 0, near -1, Group 2 is slightly above 0 (except for the outlier) and Group 3 is near 1.
Let’s look at average profiles for each group:
n <- RATS$Time %>% unique() %>% length()
RATS1 <- RATS %>%
group_by(Group, Time) %>%
summarise(mean = mean(Weight), se = (sd(Weight))/sqrt(n)) %>%
ungroup()
ggplot(RATS1, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
The patterns of weight changing are similar for Group 2 and Group 3, e.g. on the 45rd day there is a decrease in weight, but already on the next day it has risen and continues rising. It is also worth noting that the weight in Group 2 is more diverse and the least difference between weight is in Group 1.
#Now let's do the analysis of covariance:
#t.test rises an error
RATS2 <- RATS %>%
filter(Time > 0) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
#RATS%>%t.test(mean ~ Group)
RATS1 <- read.csv("data/RATS.csv")
RATS2 <- RATS2 %>%
mutate(baseline = RATS1$WD1)
fit <- lm(mean ~ baseline + Group, data = RATS2)
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 252125 252125 2237.0655 5.217e-15 ***
## Group 2 726 363 3.2219 0.07586 .
## Residuals 12 1352 113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Firstly, the data is read from csv-file:
library(dplyr)
library(tidyr)
library(ggplot2)
BPRS <- read.csv("data/BPRSL.csv")
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
Now let’s draw a plot of the data:
ggplot(BPRS, aes(x=subject, y=bprs, color=treatment)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Creating a regression model:
BPRS_reg <- lm(bprs ~ week + treatment, BPRS)
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
The model shows that the average BPRS is 46.45 and that with every week the BPRS dropped by 2 points on average, but the treatment 2 leaded to its increase by 0.57. However, the t-value of treatment to is near to zero, so we must accept the null hypothesis that there is no relationship between treatment 2 and bprs. On the other side, the situation is opposite for the week variable: it seems that that we can reject the null hypothesis because of t-value and Pr(>|t|), so there is a relationship between two variables.
Creating a random intercept model:
library(lme4)
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Creating a random intercept and random slope model:
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7977
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9803 -0.51
## Residual 97.4304 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
anova(BPRS_ref1, BPRS_ref)
## Data: BPRS
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Creating a model with interaction:
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week | subject) + (week*treatment), data = BPRS, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject) + (week * treatment)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRS
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week + treatment + (week | subject) + (week * treatment)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1